Indian Standard METHODS OF REGRESSION AND CORRELATION (Second Revision ) ICS 03.120.30 @ BIS 2003 BUREAU MANAK OF BHAVAN, INDIAN STANDARDS ZAFAR MARG 9 BAHADUR SHAH NEW DELHI 110002 No\t/)lher 2003 Price Group 7 Statistical Method for Quality and Reliability Sectional Committee, MSD 3 FOREWORD This Indian Standard (Second Revision) was adopted by the Bureau ofIndian Standards, after the draft finalized by the Statistical Method for Quality and Reliability Sectional Committee had been approved by the Management and Systems Division Council. The study of the relationship between two variables is of fundamental importance in industry. For example, in the building industry, while studying the properties of cement, it may be necessary to estimate the effect of curing time on the compressive strength. In such problems, where one variable is of particular interest for studying the effect of the other variable on it, the concept of regression is quite usefhl. The regression technique is also helpful for the purpose of prediction. In some problems, the relationship between two variables maybe of great interest, for example, in the case of steel, one can study tensile +trength by using hardness test, as the latter has a strong relationship with the former. The determination of the extent of relationship between two variables leads to the concept of correlation. This standard was originally published in 1974 to cover the statistical methods of regression and correlation in the case of two variables. This standard was revised in 1995 to include the concept of `scatter diagram' more elaborately. In view of the experience gained with the use of the standard in course of years, it was felt necessary to further revise it. In the revised version, following changes have been made: a) A table which gives the values of correlation coefficient (i-) for different selected sample sizes has been included so that the sample correlation coefficient calculated value may directly be compared with this tabulated value to test whether the population correlation coefficient is zero or not, Confidence limits for the population regression line with example has been included, Many editorial corrections have been incorporated, and The concepts at many places have been elaborated for better understanding. b) c) d) The composition of the Committee responsible for the formulation of this standard is given in Annex F. Is 7300:2003 Indian Standard METHODS OF REGRESSION AND CORRELATION (Second Revision ) 1 SCOPE This standard covers the statistical methods of linear regression and correlation in the case of two variables. The computations have been illustrated with examples. 2 -REFERENCES The following standards contain provisions, which through reference in this text constitute provisions of this standard. At the time of publication, the editions indicated were valid. All standards are subject to revision and parties to agreements based on this standard are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below: 1SNo. 6200 (Part 1): Title to be independent and the other dependent and it is usual to regard the independent variable as x and the dependent variable as y. Since the range of data varies widely, the origin of zero is sometimes inconvenient to prepare a well-balanced scatter diagram. The data ranges are suitably presented on convenient scales so that spread is close to a square and large enough for individual perception. 4.1.4 The problem of outliers is encountered in the actual preparation of scatter diagrams. Outliers are too widely separated from the data set. If there are few outliers, they should be eliminated from the data. For guidance on the criteria for rejection of outliers, reference is invited to IS 8900. If there are many (generally more than 25 percent) outliers, the causes for the same should be investigated and corrective action taken. Thereafter, fresh data needs to be collected for plotting the scatter diagram. 4.1.5 Interpretation 1995 7920 (Part 1) : Statistical tests of significance: Part 1 t-Normal and F-tests (second revision) 1994 8900:1978 9300 (Part 1) : 1979 Statistical vocabulary and symbols: Part 1 Probability and general statistical terms (second revision) Criteria for the rejection of outlying observations Statistical models for industrial applications: Part 1 Discrete models of a Scatter Diagram When a scatter diagram is prepared, it is important to interpret it accurately and take necessary measures. For this purpose, the scatter diagram should be carefully observed for the relationship between two variables. The interpretation cifthe scatter diagrams is explained as follows: a) -- In a scatter diagram, if y increases with increase in x, then the relationship is said to be positive. When the points are close to a straight line [see Fig. 1 (a)], the relationship is called a positive linear relationship. Under such conditions control on y (the dependent variable) can be achieved by exercising control on x (the independent variable). b) Negative relationship -- In a scatter diagram, if y decreases with increase in x, then the relationship is said to be negative [see Fig. 1 (b)]. In this case, similar interpretation as given for (a) holds good. -- Sometimes the c) Weak relationship relationships may not be as clearly evident as in (a) or (b) [see Fig. 1 (c)]. Further investigations may be required to find out the reasons, if any, for the wider scatter. Possibly one factor alone is not sufficient to explain the relationship fully or there could be wide Positive relationship 3 TERMINOLOGY For the purpose of this standard the definitions given in IS 7920 (Part 1) shall apply. 4 BASIC CONCEPTS 4.1 Scatter Diagram 4.1.1 The scatter diagram is useful to know the presence of the relationship or the nature of the relationship between two variables, if any. The relationship can be a cause and effect relationship, a relationship between one cause and the other, or a relationship between one effect and the other. 4.1.2 Scatter diagram can even be used by the operators to find the relationship between two variables, if any. This may lead to taking appropriate actions for quality improvement. 4.1.3 A scatter diagram is prepared by plotting the paired data in an X- Yplane. It is desirable to have more than 30 pairs of data. Of the two variables, one is said 1 IS 7300:2003 measurement errors. The relationship may not be useful for control purposes in such a situation. No relationship -- In a scatter diagram [see Fig. 1 (d)], no relationship can be noted between x and y. If the presence of relationship is expected on technological considerations, the causes/effects may be examined from other viewpoints. In such a situation the possibility of stratifying the data may also be looked into [see 4.1.5 (e)]. Relationship revealed by stratt~cation -- The scatter diagram [see Fig. 1 (e)], shows no relationship at a glance, but if the data is classified into some different groups a relationship may be possible. In this diagram, the presence of relationship can be confirmed definitely by stratifying the data into three groups marked with: ., A and X. Non-linear relationship -- In a scat~er diagram there may be relationship between x and y but is non linear. For example, in Fig, 1 (f), y increases with an increase in x until a certain point, but decreases with an increase in x beyond that point. Such relationship is called non-linear relationship and can be treated otherwise. In such situation, it is convenient to locate optimal combination for x and y. Znsuj?cient data range -- When attention is paid only to the points marked with A, there seems to be no relationship between x and y, d) f) e) d Y ,.. . . .... ..", .. Y . .. . . ".. . :, . ..' . . . ..:. . . . . . [a) Y L . .,. .'. . ... . . .. .. x . Y (c] x Y . L [b] `.. ..,1 x . .". . . . , . . (d] x Y Lllb . @ *N (e] ., k' . . . . 4? . . !A . .. . . .. .. . . . . . .. .. . . .. ".. . ", . .. x Y (f] x L1. . . .. ..':INL . dW~A4,:".: ". Ah ,~' , I 1 I . x [9) FIG. 1 VAR1OUS SCATTER DIAGRAMS 2 IS 7300:2003 as shown in Fig. 1 (g), but positive linear relationship is noted when points are-observed in a }ittle wider range. Accordingly, it is necessary to examine carefully the appropriateness of the range of x even when no relationship is suggested in the diagram prepared for the first time. 4.2 Regression or decrease in the value of x. The regression line is also used for prediction purposes. Normally, extrapolation is not recommended, and when necessary, it should be used cautiously. 5.1.1 The relationship of the type y = a + bx encountered in the regression analysis is not generally reversible and is based on the status of the variables concerned. Therefore, this type of relationship should not be used for predicting x for given y. However, mathematically it is possible to find relationship of the type x = a + b `y and then the regression lines intersect at the point (x, y) in the x, y plane. 5.2 Method of Calculation (Ungrouped Data) Regression deals with situations when one variable is dependent on the other variable. For example, the two variables may be the quantities of the carbon steel and alloy steel produced from the same raw material or charge, elongation of boiler plate and the amount of tension applied, amount of rainfall and the yield of a crop, and soon. Of the two variables, one is independent (generally measurable) and the other is dependent (desired to be controlled). Thus, it is evident that the production of alloy steel depends on the production of carbon steel so that the quantity of carbon steel produced could be considered as the independent variable and that of alloy steel as the dependent variable. 4.3 Correlation Correlation deals with the relationship between two factors or variables. The degree or intensity of the linear relationship is measured by correlation coefficient. It may be mentioned that in the study of correlation, it is not the intention to find the effect of one variable over the other as in the case of regression analysis but it is to find the degree to which the variables vary together owing to influences which affect both of them. However, the mere existence of high value of the correlation coefficient is not necessarily indicative of the underlying relationship between the two variables. Such a value can at times be purely accidental, the two variables having no connection whatsoever. In such cases, the correlation coefficient may be spurious. 4.4 Before carrying out any regression or correlation study, it is desirable to look at the scatter diagram to locate the outliner, if any and eliminate them. 5 REGRESSION ANALYSIS 5.2.1 Let there be n pairs of observations for x and y corresponding to the items in the sample. For fitting the regression line the following expressions are then calculated: a) b) c) Average of x Average of y ,=Q n () (1 ~ =~ Zy Corrected sum of squares for x z(x-Y)2=zx* -[(xx)2 /n] d) Corrected sum of squares for y X(Y-7)2 = ~y' -[(~y)2 /n] e) Corrected sum of products x(x-q(y-j7)= zxy-[(xx)(xy)/iJ NOTE -- A suitable moforma as given in Annex A mav be helpful in the above c~mputations. - 5.2.2 From the above quantities coefficient b or b' is calculated as: b= b,= the regression Corrected sum of products Corrected sum of squares for x Corrected sum of products Corrected sum of squares for y 5.1 `Regression Coefficient Also the constant a or a` of the regression equation is obtained as: a= jj - b~ in a scatter diagram of type [see 4..1.5 (a) or (b)] a straight iine could be fitted to the observed values which is of the form-y= a + bx, wherey is the dependent variable and x the independent variable. The quantity a in the above equation represents the value ofy when x = O, and b denotes the slope of the line and is known as the regression coefficient which maybe negative or positive depending on the orientation of the line with respect to the axes. Physically, b indicates the rate of increase or decrease in the value ofy for unit increase 3 ar=y­b'~ 5.2.3 When the regression model is not of the linear type and involves powers or exponential, the model may be reduced to the linear type "withthe help of the logarithmic transformation. Thereafter, the fitting of the regression line is exactly similar to the one explained in 5.2.2. IS 7300:2003 5.2.4 Example Table 1 gives the Brinell hardness number and the tensile strength (expressed in units ofmegapascals) for 15 specimens of cold drawn copper. Consider Brinell hardness number as the independent variable (x) and tensile strength as the dependent variable (y). It is intended to fit a regression line to the data. 5.2.4.1 Plotting the data given in Table 1 as a scatter diagram wherein the Brinell hardness number is measured along the X-axis and the tensile strength along the Y-axis, Fig. 2 is obtained, from which the linear trend of the points is self-evident. For the sake of better understanding, the regression line applicable to the data is also drawn in Fig. 2. Table 1 Hardness and Tensile Strength Values of Cold Drawn Copper (Clauses 5.2.4,5 .2.4.1 and 5.2.4.2) s] No. (1) i) ii) iii) iv) v) vi) vii) viii) ix) x) xi) xii) xiii) xiv) xv) Specimen No. (2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Brinell Hardness x (3) 104.2 106.1 105.6 106.3 101.7 104.4 102.0 103.8 104.0 101.5 101.9 100.6 104.9 106.2 103.1 Tensile Strength Y (4) 268.0 278.6 275.0 281.5 232.4 272.2 227.5 255.1 259.5 229.0 233.8 205.9 272.0 280.3 242.2 Xxy = 396226.39 Corrected sum of squares for x Corrected sum of products = 161520.87- [(1 556.3)2 /15] = 161 520.87­ 161471.31 = 49.56 = 396226.39- b a 1556";;3813 ( = 396226.39 ­ 395611.46 = 614.93 = 614.93/49.56= 12.4 = ) ~- bE=254.2­l 286.5=- 1032.3 Hence regression line is obtained as y = ­ 1032.3 + 12.4x 5.2.4.3 For simplifying the computational work involved in fitting a regression line, change of origin is often helpful in one or both the variables. Thus, for the example worked out in 5.2.4.2, if the variables x and y are changed to u and v such that u = x ­ 100 and v = y ­ 250, then the computations would be as follows: h xv ii = 3.75 = 56.3 v= 4.20 = 63.0 ~U2 = 260.87 ~UV = 851.39 Z(U - ij)2 =Zu2­[(Zu)2/n] =260.87 -211.31 = 49.56 Z (u -Z ) (v -V) = 2fv­ [(ZU) (Zv)/n] = 851.39- 236.46= 614.93 b= 614.93/49.56= 12.4 and ii- b.z = 4.2-46.5 =­42.3 Hence the regression line is obtained as T= ­ 42.3 + 12.4 z which when transformed to the original variables, comes out as: 5.2.4.2 From the data in Table 1, various computations are obtained as follows: Zx = 1556.3 Zy = 3813.0 h' = 161520.87 % = 103.75 ~ = 254.2 (y-250) =-42.3 + 12.4 (Xthat is y = ­1 032.3 + 12.4x 100) NOTE -- It would be of interest to observe that the regression coetllcient b is not affected by the change of origin of either or both the variables. 100 FIG. 102 104 Brineli Hardness No. (x) 106 THE REGRESSION LINE 108 2 SCATTER DIAGRAM ALONGWITH 4 IS 7300:2003 5.2.4.4 From this equation the expected value of tensile strength for any given Brinell hardness number could be obtained. Thus, when the hardness number is known as 105 the corresponding expected value of tensile strength would be 269.7 megapascals. 5.2.5 Construction Regression Line of Confidence Limits for the The model isy = et+ ~x + error The estimates b of ~ and u of a are obtained for the example as : u =­ 1 032.3, b= 12.4 x and y are presented in the form of a frequency table. In such situations the range of each variate is divided into a number of class intervals of equal width (say 1X for p classes of independent variable x and lYfor q classes of dependent variable y). The class width for x and y need not be equal, and the frequency JXiY. in the cell is determined by the ith class interval of t~e first variate andjth class interval of the second variate. This would result in a bivariate frequency distribution table (see Annex B) Table 2 Confidence Limits for Regression Line (Clause 5.2.5) x (1) 100.6 101.5 102.0 103.1 103.8 104.2 104.9 105.6 106.3 Y (2) 215.14 226.30 232.50 246.14 254.82 259.78 268.46 277.14 285.82 Upper Limit (3) 223.05 232.59 237.99 250.36 258.78 263.86 273.14 282.18 292.63 Lower Ljmit (4) 207.23 220.01 227.01 241.91 250.85 255.70 263.78 271.50 279.01 The error sum of squares (02Y, J is given by : Z(y-a-bx,)2 /(15­2)=30.813 For a particular value of X = x (Brinell hardness = x), the predicted value of the tensile strength ( j ) is: j=­ 1032.3 + 12.4x The standard deviation of j given X= x is s(~) =~Y,x[(l/n)+ {(x­ Z)21Z(X­ ~)2}]x Therefore, for a given x the confidence limits on the value of y are j *ts(j/x) where t is the value of a tdistribution with (n ­ 2 = 13) degrees of freedom. Since we are interested in the confidence limits for the whole of the regression line, these limits for individual j have to be relaxed. The appropriate multiplier is (2 F)'A where F is the upper 5 percent tail of F distribution with degrees of freedom yl = 2 and yz = (n - 2)= (15 ­2)= 13. From the tables of F, the value of F (2, 13), at 5 percent level ofsigniticance is 3.80. So the multiplier is = (2 x 3.8)%= 2.76. So the confidence limits for the regression line are given by: a +bx+2.76 5.3.2 As a first step for calculating the regression line, another proforma (see Annex C) is to be prepared. 5.3.3 The different entries in the above proforma are explained below: a) In the top row are given the mid-values of the class intervals for the independent variable x whereas in the first column are given midvalues of the class intervals for the dependent variabley. In the cohtmn~Yare given the total frequencies of the corresponding rows whereas in the row corresponding to ~, are given the total of the corresponding frequencies in the various columns. b) In the row corresponding to u are given the transformed variables forx which are obtained by subtracting an arbitrary quantity XO (preferably value ofx closest to median) from each of the mid-values of the class intervals for x variate and dividing these differences by the width of the class intervals for x variate. That is, u =(x ­ xO)//X, where 1X is the width of the class interval for x. A similar transformed variable v is given for the variate y in the respective column v = @ ­ yO)//Y. c) The next two rows, namely, @, and u2~ are self-explanatory. So also the two columns corresponding to vfY and tify, d) The row corresponding to V is obtained as sum of the products of v and the corresponding frequency in the column. So 5 [~m((l/15)+(x-Z)2 and /49.56}], a +bx-2.76[4m((l/ 15)+ (x-%~ /49.56}] Therefore, the confidence limits of regression line for the data given in Table 1 have been calculated from the above expressions and are given in Table 2. NOTE -- The upper and lower limits for the regression of this term is small. As x deviates from Y , the contribution Ibis term increases. line of form a hyperbolic curve. When x is close to z tbe contribution 5.3 Method of Calculation (Grouped Data) 5.3.1 Sometimes, the observations on the two variables IS 7300:2003 also thecolumn corresponding to Uconsists ofentries obtained as the sum of products u with the corresponding frequency in the row. The last row uVas also the last column VU are seIf explanatory. The regression coefficient is computed as: b = [329 - {(-78) (-35)/82}] x 2/[410 - {(-78)2 /82}] e) x 20 = 0.0881 = 0.09 The constant of the regression line is obtained as: {12-(2x 35/82)} -0.0881 {140-(20x 78/82)} = 11.15­0.0881 x 120.98= 11.15­10.66 = 0.49 Hence the regression line is obtained as: a = y = 0.49+ 0.09x 5.3.4 To ensure the correctness of computations it would be necessary to verifi the following checks from the above pro forma: The total frequency of the row corresponding to~Xshould be equal to the total frequency of the column corresponding tot. b) The total of the row corresponding to Vshould be equal to the total of the column corresponding to vfY. c) The total of the column comesponding to U should be equal to the total of the row corresponding to ufX. d) The sum of the last row corresponding to UV should be equal to the sum of the last column corresponding to vU. a) 5.3.5-After the computations using the above proforma, the regression coefficient is calculated by the following formula: b = [Zu v- {(w) (Wyrr}] $/[wfx(XUfx)%l] lx This line can be used for predicting the quantity of sugar produced knowing the quantity of the cane crushed. Thus if 100 tonne of cane is crushed, then by the above equation, 9.5 tonne of sugar can be expected to be produced. NOTE -- For the purpose of prediction, the regression line may be used only within the range of the independent variable and in the vicinity of the terminal values. 5.4 Testing for Regression Coefficient The constant of the regression line is obtained as: a = y ­ b x = @e + lY(XVfY/n)} ­ b {XO+ 1X (~ufX/n)} 5.4.1 From the way the regression coefficient is to be calculated, it is obvious that its value depends on the sample observations. Hence if a new set of observations is obtained from the same populaticm and the corresponding regression coefficient is calculated, it may not necessarily be the same as earlier one. Because 5.3.6 Example Table 3 gives the distribution of 82 small and medium size sugar factories by the quantity of cane crushed (x) and the quantity of sugar produced (y). Fit a regression line. As a first step, the computation are made in Table 4. of this fluctuation it is necessary to test whether the regression coefficient as calculated from the sample observations differs significantly from some specfled value which may correspond to the entire population. Sometimes, the specified value may also be a rounded off value which seems more feasible as the population regression coefficient. However, for any testing of the regression coefilcient to be valid, it is assumed that both the independent and the dependent variables by Cane Crushed and Sugar Produced Table 3 Frequency Distribution of Sugar Factories (Clause 5.3.6) 51 No. Sugar Produced in t 000 tonue (y) (2) 3.0- 4.9 5.0- 6.9 7.0- 8.9 9.0-10.9 11.0-12.9 13.0-14.9 15.0-16.9 17.0-18.9 19.0-20.9 21.0 -22.9 30-49 (3) 2-­ ­ 50-69 (4) 70-89 (5) Cane Crushed in 1000 tonne (x) (1) i) ii) iii) iv) v) vi) vii) viii) ix) x) 90-109 (6) -- 110-129 (7) 130-149 (8) 150-169 (9) -- 170-189 (lo) 190-209 (11) 210-229 (12) -- -- 6 -- -- ­ -- -- 2 8 -- -- -- -- -- $ 4 11 1 -- -- -- l­­ 6 7 17 9 -- -- -- -- -- -- 2 4 1 -- -- -- 1 -- 1 -- 121 1 -- -- 1 1 6 IS 7300:2003 Table 4 Proforma for Computation of Regression (Clause 5.3.6) Y 4.0 6.0 Sio 10.0 12.0 14.0 16.0 18.0 20.0 22.0 L Ii 40 2 . -- -- -- -- -- -- -- 2 -5 60 -- 6 -- 2 84 -- -- -- -- 6 4 -24 96 -18 72 80 Line for the Data Given in Table 3 100 -- 1 11 1 -- -- -- 17 -2 -34 68 -22 44 120 ----1 6 7 140 -- ----1 7 9 160 180 -- 200 220 Lv 2 4 VA -8 -27 -26 -18 13 d~, 32 81 52 18 13 848 32 u -10 -32 -33 -28 -50 4 824 7 vu 40 96 66 28 4 -9-3 1--18000 2 4 ­ 1-1-224 12-33927 -- 1 7 252825 13-2 18 13 -1 1 -- 14 -1 -14 14 -8 8 17 I 2 4 8 28 10 -3 -30 90 -22 66 uf, I(2J, ~ -10 50 -8 40 0 0 0 8 1 7 7 6 6 /U+lq v 11 0 follow a normal distribution. For further details of the normal distribution, see IS 9300 (Part 1). 5.4.2 Ungrouped Data The value of the regression coefficient as calculated from the sample is an estimate of the true regression coefficient for the entire population. To judge whether the population regression coefficient differs significantly from a specified value, DO,the null hypothesis, HO:~ = PO is tested against the alternative hypothesis, h', : ~ # POby computing the following test-statistic: used to verify the assumption that the change in the independent variable does not affect the dependent variable in the population in a systematic manner. 5.4.2.1 Example In the illustration given in 5.2.4 concerning the regression equation for predicting the tensi Ie strength from Brinell hardness number for cold drawn copper, it may be of interest to test whether the population regression coefficient "is significantly different from the specified value of 11.0 which was found to hold good in an earlier investigation done on a Iarge,number of samples. Inthiscase,170:~= ll.OandlY1:~# 11.0 (b-pJ{x(x-q'y' [1 [(W7)2-W-W+(A)] regression coefficient computed from the data, specified value of the regression coefficient, corrected sum of squares for x, corrected sum of squares for y, corrected sum of products, and sample size. The calculated value oft shall be compared with the tabulated value oft [see Annex B of IS 6200 (Part 1)] at desired level of significance (normally 5 percent) and for (n ­ 2) degrees of freedom. If the calculated value oft is greater than or equal to the tabulated value, the null hypothesis is rejected and the alternative hypothesis that the population regression coefficient is significantly different from the specified value of PO is accepted, otherwise not. As a particular case, when PO= O the test would be 7 The t-statistic is calculated as follows: t (12.4 -1 1.0)(49.56)1'2/[{8030.~X614.93 )}/13]"2 = 1.4 X 7.04/ 5.59= 1.76 = (12.4 The tabulated value of t at 5 percent level of significance and for 13 degrees of freedom is given as 2.160. Since the calculated value is less than 2.160, HOis not rejected and it is concluded that the population regression coefficient is -not significantly different from 11.0. 5.4.3 Grouped Data In the case of grouped data for testing whether the population regression coefficient is significantly different from the specified value of 00, the null hypothesis, Ho: ~ = PO is tested against the alternative hypothesis, HI : ~ *PO and the t-statistic is computed as: t A = = Alti (b - p,) [z'f'~ - {(zujJVn}]'". /x IS 7300:2003 two factors under study which implies that a definite increase in one factor is accompanied by a proportionate increase in the other factor. n-2 The testing can be done on the same lines as indicated in 5.4.2. 5.4.3.1 In the example given in 5.3.6 for predicting the quantity of sugar produced from the quantity of cane crushed, it maybe of interest to examine whether the population regression coefficient is significantly different from zero. Inthiscase, HO: ~= O;H1 :~#0 6.1.2 If r = ­1 then perfect negative correlation is present, meaning thereby that a definite increase in one factor is followed by proportionate decrease in other factor, or vice versa. If the correlation coefficient is zero then the two factors are said to be uncorrelated. The correlation coefficient is a pure number and its magnitude is unaffected by the scale in which the two variables x and y are measured. 6.1.3 Figure 3 gives the scatter diagram for the two variables x and y in three situations, namely, when the correlation coefficient is high positive (say, 0.9), zero and high negative (say, ­ 0.9). 6.2 Method of Calculation (Ungrouped Data) t-statistic is computed as follows: A = 0.09 [410 ­ {(­ 78)2/82}]1'2x 20= 32.9850 B=[{313-(-35) [0.09 {329 -(-78) h82)x4](-35) /82}x20x 80 2] = 1.5962 6.2.1 Let there be n paired observations x and y corresponding to the items in the sample. The average of x, average of y, corrected sum of squares for x, corrected sum of squares for y and corrected sum of products are then calculated as given in 5.2.1. t = 32.985 O/~~= 32.985 0/1.263 4 = 26.1 6.2.2 From the above quarttities coefficient is cakulated as follow~ ,-- the correlation Since the tabulated value oft at 5 percent level of significance and 80 degrees of freedom is given as 1.96, HO is rejected and it is concluded that the population regression coefficient is significantly different from zero. 6 CORRELATION 6.1 Correlation Coef!lcient Corrected sum of moducts (Corrected sum of squares forx)x [ (Corrected sum of sum of squares fory) 1 % 6.1.1 The correlation coefficient is usually denoted by 6.2.3 When the status of the two variables are not known (see 5.1.1) the two regression coefficients obtained in fitting the lines' y = a + b x and x = a` + b `y and the correlation coefficient r are related as r = b b`. 6.2.4 Example the symbol p with respect to the population under study. When the study is based on a sample drawn from a population, it is denoted by the symbol r. Values of the correlation coefficient lie between ­ 1 and+ 1. If -it is+ 1, perfect positive correlation exists between the An investigation was carried out on 4-litre paint tins `for finding the correlation between the capacity as calculated from the base dimensions and height and Y . . . ,. . . . . .". .. . . .". . . .. . .. . .. Y I .. Y . . . . ., -.b . . . . 1" ~~ ,. ."" .. " ".. . . "." . o HIGHPOSITIVE CORRELATION x o NO CORRELATION x o HIGHNEGATIVEx CORRELATION FIG. 3 SCATTER DIAGRAM 8 IS 7300:2003 the actual measured capacity with a view to reducing the testing for the latter characteristic which was more time consuming as compared to dimensional checking. Table 5 gives the data obtained on 35 such tins. From the data tabulated in Table 5 the various computations are obtained as follows: Table 5 Capacities of 4-Litre Paint Tins (Clause 6.2.4) S1 No. of Tin (1) Calculated x (2) 4.732 4.735 4.756 4.709. 4.708 4.768 4.726 4.744 4.686 4.693 4.695 4.694 4.692 4.727 4.729 4.745 4.741 4.704 4.741 4.745 4.771 4.774 4.768 4.758 4.772 4.779 4.763 4.757 4.781 4.784 4.758 4.753 4.753 4.732 4.757 Capacity Measured Capacity Y (3) 4.530 4.540 4.550 4.540 4.540 4.500 4.490 4.510 4.485 4.495 4.480 4.485 4.485 4.490 4.490 4.500 4.500 4.510 4.510 4.515 4.520 4.515 4.510 4.525 4.520 4.550 4.550 4.540 4.550 4.555 4.515 4:520 4.525 4.525 4.530 El= z} = zr~ = zy~ = xl-)) = Corrected Corrected Corrected z = 4.74 165.930 7=4.517 158.095 786.678154 714.131775 749.518555 sum of squares for x = 0.027728 sttm of squares for y = 0.016660 = 0.012745 sum of products Hence the correlation coefficient is equal to r = 0.012 745/(0.027 728 x 0.016 660)"2= 0.59 NOTE -- It may be of interest to observe that the correlation coefficient r is not affected by the change of origin and scale for either or both the variables. Hence the computations given in the above examples can be simplified considerably by making the transformations as follows: 11 = (x 4.700) x 1 000 1000 ! = @ - 4.500)x 6.3 Method of Calculation (Grouped Data) 6.3.1 If the observations on the two variables x and y are presented in the form of a frequency table in which the range of each variate is divided into a number of class intervals and the frequency [,iYjcorresponds to the cell determined by the ith class interval of the first variate and thejth class interval of the second variate then the initial computations for obtaining the correlation coefficient would be exactly the same as given in 5.3.1 and 5.3.2. 6.3.2 After the necessary tabulation of initial computations (see 5.3.2) are made, the correlation coefficient is obtained by the following formula: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 16+66.88 = [(254 -57.76)(264 -77.44)j = 0'43 /-= M'-{(xu)(H')hl} 6.4 Testing for Correlation Coefficient 6.4.1 Correlation coefficient as calculated from the [{xu'.f,-(xufxrln] {zv'fy-(zwyyin]] sample data is the estimate of the correlation coefficient 6.3.3 Example Table 6 gives the distribution of 100 casts of steel by the percentage of iron in the form of pig iron (x) and the lime consumption in quintal per cast (y). As a first step the computations as in Table 7 are made: The correlation coefficient r is calculated as: ,. = 16-76(-88)/100] [{254-(76)WO){264-(-88~,H30}]X 9 applicable to all the items in the population. It may, however, sometimes be necessary to test whether the population correlation coefficient differs significantly from a specified value of pO.The corresponding tests to be performed when p is equal to zero and when PO is anon-zero value are slightly different and are given below. 6.4.2 To judge whether the population correlation coeftlcient differs significantly from zero (that is p = O), the null hypothesis, HO : p = O is tested against the alternative hypothesis, HI : p # O by computing the 1S"7300 :2003 Table 6 Frequency Distribution of Percentage of Pig Iron (x) and Lime Consumption (y) (Clause 6.3.3) sl No. Lime Consumption in Quiistal per Cast (y) (2) 100-124 125-149 150-174 175-199 200-224 225-2 250-274 275-299 300-324 Percentage F 20-24 (3) 25-29 (4) 30-34 (5) of Pig Iron (x) A \ (1) 1) 35-39 (6) 1 7 6 12 3 1 40-44 (7) 45-49 (8) 50-54 (9) -- 1 1 3 .8 1 2 6 11 2 1 -- 1 11) lit) l\ ) 1) \l\ \il) \ Ill) I\) 1 .-- 49 2 1 1 6 6 ~ 5 3 31 1 -- -- -- -- -- -- 1 -- -- -- Table 7 Computations (Clause 6.3.3) 22 112 137 162 I87 212 237 262 287 312 -- 27 -- 2 1 6 6 -- 1 -- -- 32 37 17 6 12 3 1 -- 1 -- 31 0 0 0 -45 0 42 1 1 3 8 2 -- -- -- 15 1 15 15 -6 ­6 47 52 -- -- 2 6 11 3 1 2 5 3 1 1 f= 1-4 11 19 32 8 2 1 v ­3 ­2 -1 0 1 2 3 vfy 4 -33 -38 -32 0 8 4 3 V2L 16 99 76 32 0 8 8 9 u vu -- II -- -- -- -- -- 0 -4 0 12 0 24 0 -24 0 -- -- -- -- 5 0 0 f\ f( II,f, Iif, ,. 111' 1314 -3 -2 -3 ­6 9 -2 6 12 -8 16 24 -1 -14 14 -20 20 i'!!ig~i -2 -18 16 following statistic: f =r(n­2)''* /(l ­P)"* given in Annex D. If the calculated value of correlation coefficient value is less than the tabulated value, the null hypothesis is accepted, otherwise not. 6.4.2.1 Example In the illustration. given in 6.3.3 wherein the correlation coefficient between the percentage of pig iron and lime consumption in quintal per cast was computed as 0.43, if it is intended to test whether the population correlation coefficient is significantly different from zero, the null hypothesis is fZO:p = 0.43 and the alternative hypothesis is H] : p # 0.43. The t-statistic is computed as: t= t = r (n- 2) ''2/(1­ ?)1'2 uherc r = correlation coefficient as computed from (I1csample and n is the sample size. The value oft so calculated shall be compared with the `tabulated value oft [see Annex B of IS 6200 (Part 1)] at the desired level of significance (normally 5 percent) and for (n --2) degrees of freedom. If the calculated value of ~is greater than or equal to the tabulated value, If. is rejected and the population correlation coefficient is said to be significantly different from zero, meaning thereby, that the two factors under consideration are correlated. However, if the calculated value oft is less than the tabulated \alue, Ff~j is not rejected and it indicates that the sample data does not show any evidence that the factors under consideration are correlated. For some selected values of sample sizes n, the table \ :IILICS of r have been calculated for critical WdLtW oft :[t 5 percent and 1 percent level of significance and (0.43) @/(0.815 1)1'2=4.71 Since the tabulated value of t distribution with 98 degrees of freedom and 5 percent level of significance is near about 1.96, HOis rejected and it is concluded that the population correlation coefficient is significantly different from zero, that is, the variables are associated to a significant extent. 10 IS 7300:2003 6.4.3 For judging whether the population correlation coefficient differs significantly from the specified value pO(other than zero), the. null hypothesis, Z-$: p # pO shall be tested against the alternative hypothesis, lfl : p # pO.The sample correlation coefficient r and the specified value pOshall be transformed into z and ZO with the help of Annex E and the following statistic shall be computed: t=lz­zo[(n­3)"2 where Iz ­ ZJ denotes the value of the difference between z and ZO ignoring the sign. If the value of this statistic is less than or equal to 1.96 (corresponding to 5 percent level of significance of the normal deviate), then Ho is not rejected and it indicates that the population correlation coefficient is not significantly different from pO. In case the calculated value of the normal deviate is more than 1.96, HOis rejected and it indicates that the population correlation coefficient is significantly different from the specified value of pO. NOTE -- If the level of significance chosen is 1 percent then instead of 1.96 the value comparison. 2.58 is to be used in the above 6.4.3.1 Example In the illustration given under 6.2.4 wherein the correlation coefficient between the calculated capacity and measured capacity of 4-litre paint tins was computed as 0.59, it may be of interest to test whether the population correlation coeftlcient is significantly different from 0.70. In this case, null hypothesis is EIO: .p = pO and the alternative hypothesis is Jfl : p # pO. From Annex E, the value of z corresponding to r = 0.59 is obtained as 0.677 7 and that of ZO corresponding to pO= 0.70 is obtained as 0.8673. Hence I z -ZO\(n-3)1n=10.6777 0.1896 x 5.66= 1.073 -0.867 31@= Since this value is less than 1.96, IfO is not rejected and there is not enough evidence to conclude that population correlation coefficient is significantly different from 0.70. ANNEX A (Clause 5.2.1) PROFORMA FOR COMPUTATION OF CORRELATION/REGRESSION Product: Independent variable (x): Dependent variable (y): Unit of measurement: Unit of measurement: S1 No. x Y ~=~ x­x lx ~= Y­Yo 1, U2 (1) (2) (3) (4) (5) (6) (7) (8) Total Mean NOTE -- In case the variables x andy are not transfomred to u and v respectively, COI6,7 and 8 may be utilized for tabulating F, ~ and xy. 11 ANNEX B (Clause 5.3.1) BIVARIATE FREQUENCY DISTRIBUTION TABLE "w x. Yj j=l p classes of equal width for x-variable i=l 2 3 . . .. . .. i ... ... ... .. . P Total 2 3 ~iyj 9 n ~iyj n = frequency in the (i,j) cell. = total frequency row. NOTE -- x, is the mid-point of the interval ith column and y, is the mid-point of intervaljth ANNEX C (Clause 5.3.2) FOR CALCULATING REGRESSION LINE FOR GROUPED DATA v_ Y­Yo 1, VfY $& u vu n x­x A ~=--.--L lx ufx u2fx . A -- v Uv 12 -- Is 7300:2003 ANNEX D (Clause 6.4.2) TABULATED n VALUES OF r FOR 5 PERCENT AND 1 PERCENT Calctdated LEVEL OF SIGNIFICANCE Valuesofr 5 percent Level of Significance 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1.0000 1.Goo o 0.9540 0.8910 0.8294 0.7743 0.7266 0.6855 0.6498 0.6188 0.5915 0.5674 0.5457 0.5265 0.5088 0.4930 0.4784 0.4650 0.4526 0.4412 0.4307 0.4207 0.4115 0.4028 0.3947 0.3870 0.3797 0.3727 1 percent Level of Significance 1.0000 1.0000 0.9857 0.9559 0.9188 0.8801 0.8426 0.8076 0.7755 0.7461 "0.719 3 0.6948 0.6723 0:6518 0.6329 0.6154 0.5991 0.5840 0.5701 0.5569 0.5447 0.5332 0.5223 0.5122 0.5024 0.4934 0.4847 0.4764 ANNEX E (Clause 6.4.3) THE Z-TRANSFORMATION r OF THE CORRELATION COEFFICIENT (z = tanh-lr) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.00 0.0000 0.1003 0.2027 0.3095 0.4236 0.5493 0.6931 0.8673 1.099 1.472 0.01 0.0100 0.1104 0.2132 0.3205 0.4356 0.5627 0.7089 0.8872 1.127 1.528 0.02 0.0200 0.1206 0.2237 0.3316 0.4477 0.5763 0.7250 0.9076 1.157 1.589 0.03 0.0300 0.1307 0.2342 0.3428 0.4599 0.5901 0.7414 0.9287 1.188 1.658 0.04 0.0400 0.1409 0:2448 0.3541 0.4722 0.6042 0.7582 0.9505 1.221 1.'738 13 0.05 0.0500 0.1511 0.2554 0.3654 0,4847 0,6184 0.7753 0.973 1.256 1.332 0.06 0.0601 0.1614 0.2661 0.3769 0.4973 0.6328 0.7928 0.996 1.293 1.946 0.07 0.0701 0.1717 0.2769 0.3884 0.5101 0.6475 0.8107 1.020 1.333 2.092 0.08 0.0802 0.1820 0.2877 0.4001 0.5230 0.6625 0.8291 1.045 1.376 2.298 0.09 0.0902 0.1923 0.2986 0.4118 0.5361 0.6777 0.8480 1.071 1.422 2.647 IS 7300:2003 ANNEX F (Foreword) COMMITTEE COMPOSITION Statistical Methods for Quality and Reliability Sectiomd Committee, MSD 3 CJrganiza/ion Kolkata University, Kolkata Limited, Hyderabad Representative(s) PROF S. P. MUKHEFOEE (Chairman) SHRI S. N. JHA SHRIA. V. KRISHNAN (Alternate) 13hamt tie~vy EIectricals Ccmtinental Devices India Ltd, New Delhi Dircc(omlc General of Quality Assurance, Laser ScIcncc and Technology Escorts Limited, Faridabad 1IMT Ltd. R & D Centre, Bangalore Indian Agricultural )nd!an Association Kolkata indm Statistics Research Institute (lASRI), New Delhi Quality & Reliability (IAPQR), New Delhi DR NAVINKAPUR SHRI VIPUL GUPTA (Alternate) SHRI S. K. SRIVASTVA LT-COL P. VIJAYAN (Ahernate) Centre, DRDO, New Delhi DR ASHOKKUMAR SHRIC. S. V. NARENDIIA SHRI K. VIJAYAMMA DR S. D. SHARMA DR A. K. SRIVASTAVA (A/terna[e) for Productivity, DR B. DAS lnstttute of Management (IIM), Lucknow PROF S. CHAKRABDRTY PROF S. R. MOHAN PROF ARVINDSETH (Aherna?e) Indian Statistical Institute (1S1), Kolkata for Quality and Reliability National Institution (NIQR), New Delhi SHRI Y. K. BHAT SHRt G. W. DATEi' (Aherna/e) Powergrid Corporaticm of India Ltd, New Delhi SRF L.imitcd, Chennai Standardization. Ncw Delhi Tata Engineering Testing and Quality Certification and Locomotive Directorate (STQCD), DR S. K. AGAFtWA~ SHRJ D. CHAKRABORTV (,4/ternate) SHRI A. SANJEEVAfio SHRI C. DESIGAN (Ahernale) SHRI S. K. KIMOTHI SHRJ P. N. SRIKANTH (,4/ternate) Co Ltd (TELCO), Jamshedpur SHRI S. KUMAR "Stiut SHANTISARUP (Alternate) University of Delhi, Delhi in personal capacity (B-109, Malviya Nagar, In personal capacity (20/1, Ne\I Delhi I /0029) BIS Directorate General Krishna New Delhi 110017) Safdarjang Enclave, PROF M. C. AGRAWAL PROF A. N. NANKANA Stiu D. R. SEN SHRI P. K. GAMBHIR,Director & Head (MSD) Nagar, [Representing Member Secrelary SHRI LALIT KUMARMEHTA Director General (Ex+flcio)] Deputy Director (MSD), BIS Basic Statistical Methods Subcommittee, MSD 3:1 Kolkata University, Kolkata Centre, DRDO, New Delhi Research Institute (lASRI), New Delhi Quality and Reliability PROF S. P. MUKHERJEE (Convener) DR ASHOKKUMAR Laser Science and Technology [ndian Agricultural Statistics DR S. D. SHARMA DR DEaABRATAfiY (Ahernute) Indian Association for Productivity, (IAPQR), Kolkata Indian Institute of Management DR B. DAS DR A. LAHIRI(A/terns/e) PROF S. CHAKRABORTY (lIM), Lucknow (Continued on page 15) 14 IS 7300:2003 (Con/inued,fiom page 14) Organization Indian Statistical Irrstitute(K[), Ko]ka~ Representative(s) PROF S. R. MOHAN SHFUY. K. BHAT SHRt G. W. DATEY (Alternate) National Institution for Quality and Reliability (NIQR), New Delhi Powergrid Corporation oflndia Ltd, New Delhi DR S. K. AGARWAL SHRI S. K. KiMOTHI SHIUSHANT]SARUP DR A. INDRAYAN PROF M. C. AGRAWAL PROF A. N. NANKANA SHR] D. R. SEN Standardization, Testing and Quality Certification (STQC), New Delhi Tata Engineering University University and Locomotive CoLtd (TELCO), Delhi Pune CoHege of Medical Sciences, of Delhi, Delhi In personal capacity (B-109, Malviya Nagar, New Delhi 110017) In personal capacity (20/1, New Delhi 110029) Krishna Nagar, Safdarjung Enclave, Panel for `Basic Methods Including Terminology, MSD 3: l/P-2 National Institute for Quality and Reliability Laser Science and Technology Indian Agricultural Indian Statistical Statistics (NIQR), New Delhi SHIUG. W. DATSY (Convener) DR ASHOKKUMAR Centre, DRDO, New Delhi Research Institute, (lASRI), New Delhi DR S. D. SHARMA PROF S. R. MOHAN Institute (lSI), New Delhi (NIQR), New Delhi National Institute for Quality and Reliability Po\vergrid Corporation [n personal capacity New Delhi 110029) SHR] Y. K. BHAT DR S. K. AGARWAL of India Ltd, New Delhi (20/1, Krishna Nagar, Safdarjung Enclave, SHRt D. R. SEN 15 i Bureau of Indian Standards Standards Act, 1986 to promote BIS is a statutory institution established under the Bureau oflndian harmonious development of the activities of standardization, marking and quality certification of goods and attending to connected matters in the country. Copyright BIS has the copyright of all its publications. No part of these publications may be reproduced in any form without the prior permission in writing of BIS. This does not preclude the free use, in the course of implementing the standard, of necessary details, such as symbols and sizes, type or grade designations. Enquiries relating to copyright be addressed to the Director (Publications), BIS. Review of Indidn Standards Amendments are issued to standards as the need arises on the basis of comments. Standards are also reviewed periodically; a standard along with amendments is reaffirmed when such review indicates that no changes are needed; if the review indicates that changes are needed, it is taken up for revision. Users of Indian Standards should ascertain that they are in possession of the"latest amendments or edition by referring to the latest issue of `BIS Catalogue' and `Standards: Monthly Additions'. This Indian Standard has been developed from Doc : No. MSD 3 (222). Amendments Issued Since Publication Amend No. Date of Issue Text Affected BUREAU OF INDIAN STANDARDS Headquarters : Manak-Bhavan, 9 Bahadur Shah Zafar Marg, New Delhi 110002 Telephones :23230131,23233375,2323 9402 Regional Offices : Central Eastern Northern Southern Western : Manak Bhavan, 9 Bahadur Shah Zafar Marg NEW DELHI 110002 : 1/14 C.I.T. Scheme VII M, V. I. P. Road, Kankurgachi KOLKATA 700054 : SCO 335-336, Sector 34-A, CHANDIGARH 160022 : C.I.T. Campus, IV Cross Road, CHENNAI 600113 : Manakalaya, E9 MIDC, Marol, Andheri (East) MUMBAI 400093 Telegrams : Manaksanstha (Common to all offices) Telephone 23237617 { 23233841 23378499,23378561 { 23378626,23379120 603843 { 609285 22541216,22541442 { 22542519,22542315 28329295,28327858 { 28327891,28327892 Branches : AHMEDABAD. BANGALORE. BHOPAL. BHUBANESHWAR. COIMBATORE. FARIDABAD. GHAZIABAD. GUWAHATI. HYDERABAD. JAIPUR. KANPUR. LUCKNOW. NAGPUR. NALAGARH. PATNA. PUNE. RAJKOT. THIRUVANANTHAPURAM. VISAKHAPATNAM. F'nnted at Prabhat Offset Press, New Delhi-2